Visualization and description

Histogram

  • Use: visualizing the distribution of interval variables.

  • Divide data into equally sized “bins” and count the number in each. The height of each bar indicates the number of values in that bin.

Histogram

Density Plot

  • Use: visualizing the distribution of interval variables

  • Sort of a “smoothed” version of the histogram.

  • The area = 1, the height indicates the amount of data in that region.

Density Plot comparison

Box plot

  • Use: visualizing the distribution of interval variables.

  • Shows the “five number summary”

  • Especially useful for making comparisons across groups or describing multiple items with similar scales.

Box plot

Bar plot

  • Use: visualizing the distribution of categorical variables

  • Count the frequency (or proportion) of observations in each group

Bar plot

Mosaic plots

  • Another option for visualizing categorical data, or multiple categories together.

  • The size of each block indicates the number of observations in that group.

Mosaic plots

Describing Data

In addition to visualization, we generally want to be able to summarize and compare characteristics like:

  • Central Tendency: “typical values” of the variable

  • Dispersion: the amount of spread around the central tendency

  • Modality: the number of “peaks” or “modes” in a distribution.

  • Skewness: the amount of asymmetry in a variable.

Measures of Central Tendency

(some things you probably remember from school)

(Arithmetic) Mean

Sum up all the numbers and divide by the total number of observations

\[ \bar{x} = \frac{1}{n}\sum^n_{i=1}x_i \]

\[ \bar{x} = \text{the mean of x} \]

\[ x_i = \text{the individual values of x} \]

\[ n = \text{the number of observations} \]

(Arithmetic) Mean

A useful feature of the mean: the summed residuals from \(\bar{x}-x = 0\)

\(x\) \(x-\bar{x}\)
3 \(3 - 6 = -3\)
4 \(4 - 6 = -2\)
6 \(6-6 = 0\)
11 \(11 - 6 = 5\)
Total: \(24\)
Mean: \(24/6 = 6\)
Total: \(-3 + -2 + 0 + 5 = 0\)

(Arithmetic) Mean

A problematic feature of the mean is that its sensitive to extreme outliers (also known as skew)

The average height of Victor Wembanyama (7 ft 4) and a bunch of regular people is probably misleading.

Median

For an even number of observations, the median is the middle number:

\[ x = 1, 3, 3,6, 7, 8, 9 \]

\[ \text{Median} = 6 \]

For an odd number of observations, the median is the mean of the two middle values:

\[ x = 1, 3, 3,6, 7, 8, 9, 11 \]

\[ \text{Median} = 6.5 \]

Median

Importantly, the median is a skew-robust measure of central tendency.

The mean and median will be similar if there’s no skew:

\[ x = 1, 3, 3, 6, 7, 8, 9, 11 \]

\[ \text{Median of x} = 6.5 \]

\[ \text{Mean of x} = 6 \]

Median

But they diverge when we include extreme outliers:

\[ x = 1, 3, 3, 6, 7, 8, 9, 100000000000 \]

\[ \text{Median of x} = 6.5 \]

\[ \text{Mean of x} = 12500000000 \]

Mode

The modal value is the value that occurs most often.

\[ x = 1, 3, 3, 6, 7, 8, 9, 11 \]

\[ \text{Mode of x = 3} \]

Unlike the mean, the mode is a valid measure of central tendency for nominal variables:

\[ \text{Tom, Earl, Tom, Sarah, Beth} \]

\[ \text{Mode = Tom} \]

Modality

Variables may have more than one modal value. For instance, the cross-national distribution of male average years of schooling is roughly bimodal

Modality

By contrast the % of a country’s population that is working-age is unimodal: most countries have between 65 to 70% and there’s no other value that is nearly as common.

Central Tendency

Nominal Ordinal Interval
Mean
Median
Mode

Measures of Dispersion

Standard Deviation

The standard deviation is a … standard measure of dispersion for interval variables based on squared deviations from the mean.

Standard Deviation

A larger standard deviation, all else equal, indicates that observations tend to deviate from the mean more.

Standard Deviation: Steps

To calculate the standard deviation for a sample:

1. Calculate \(\bar{x}\) (the mean of \(x\))
2. Calculate the residual (\(\bar{x} - x_i\)) for each value
3. Square each residual and sum.
3. Calculate the variance by dividing this total by the number of observations (minus 1)
4. Calculate the standard deviation by taking the square root of the variance.

x Deviation from mean (5) Differences squared
2 -3 9
4 -1 1
4 -1 1
4 -1 1
5 0 0
5 0 0
5 0 0
7 2 4
9 4 16
Mean = 5 Total = 0 TSS = 32

\[\text{Var(x)}=\frac{32}{(9-1)} = 4\] \[s_x = \sqrt4 = 2\]

Standard Deviation

Fortunately, we don’t have to do this by hand:

x<-c(2,4,4,4,5,5,5,7,9)

sd(x)
[1] 2

The key thing to remember is just that the standard deviation is sort of like “an average of differences from the average”

Notation

Mean and standard deviation will come up a lot.

In formulas, you’ll often see standard deviation represented using the Greek letter \(\sigma\).

The mean is often represented with the Greek letter \(\mu\).

Range and IQR

  • Range is simply the difference between the lowest and highest value

  • Interquartile Range is the difference between the 25th and 75th quartile of a variable

(which corresponds to the box part of a box-and-whiskers plot)

Dispersion

Nominal Ordinal Interval
Standard Deviation
IQR

Skew

Skew refers to the degree of asymmetry in data.

No Skew

When the distribution is basically symmetric, the mean and the median essentially overlap.

Right Skew

With right skew, extreme high values pull the mean higher than the median.

Left Skew

Skewness

  • The are multiple measures of skewness, but the one you’re likely to encounter is Fisher’s moment coefficient of skewness

  • Nevertheless, in practice checking the difference between the median and mean is a good way to identify skew.

Common transformations

Centering and standardizing

  • Centering a variable means subtracting that variable’s mean from each individual observation, giving it a mean of 0.

  • Scaling a variable means dividing each observation by that variable’s standard deviation (a measure of variability). Giving it a standard deviation of 1.

Centering and standardizing

Centering and standardizing (also called z-scaling) can make it feasible to compare two variables with different scales.

Centering and standardizing

Z standardization gives us a way to compare these variables on similar scales:

Log transformations

Log transformations are often used on variables that take on positive values greater than 0.

Remember that \(log_b(x)\) is the power by which \(b\) must be raised to equal \(x\). So \(log_{10}(100) = 2\) because \(10^2 = 10 \times 10 = 100\).

And \(log_{10}(1000) = 3\), \(log_{10}(10000) = 4\) and so on…

Log transformations

Log transformations are often used to compress skewed distributions:

Log transformations

They can also make non-linear relationships into linear relationships, which can make them easier to work with graphically and mathematically:

R code

https://tinyurl.com/aveukcha